A putative scenario of how de novo protein-coding genes originate in the Saccharomyces cerevisiae lineage

Background Novel protein-coding genes were considered to be born by re-organization of pre-existing genes, such as gene duplication and gene fusion. However, recent progress of genome research revealed that more protein-coding genes than expected were born de novo, that is, gene origination by accumulating mutations in non-genic DNA sequences. Nonetheless, the in-depth process (scenario) for de novo origination is not well understood. Results We have conceived bioinformatic analysis for sketching a scenario for de novo origination of protein-coding genes. For each de novo protein-coding gene, we firstly identified an edge of a given phylogenetic tree where the gene was born based on parsimony. Then, from a multiple sequence alignment of the de novo gene and its orthologous regions, we constructed ancestral DNA sequences of the gene corresponding to both end nodes of the edge. We finally revealed statistical features observed in evolution between the two ancestral sequences. In the analysis of the Saccharomyces cerevisiae lineage, we have successfully sketched a putative scenario for de novo origination of protein-coding genes. (1) In the beginning was GC-rich genome regions. (2) Neutral mutations were accumulated in the regions. (3) ORFs were extended/combined, and then (4) translation signature (Kozak consensus sequence) was recruited. Interestingly, as the scenario progresses from (2) to (4), the specificity of mutations increases. Conclusion To the best of our knowledge, this is the first report outlining a scenario of de novo origination of protein-coding genes. Our bioinformatic analysis can capture events that occur during a short evolutionary time by directly observing the evolution of the ancestral sequences from non-genic to genic. This property is suitable for the analysis of fast evolving de novo genes.


Introduction
Almost all protein-coding genes were born by re-organization of pre-existing genes, such as gene duplication and gene fusion, and it has been considered that few of them were born de novo, that is, gene origination by accumulating mutations in non-genic DNA sequences [1].However, it has become clear that more protein-coding genes than expected were born de novo, as comprehensive data, such as RNA-seq and ribosome profiling, have been released [2][3][4][5][6][7].Moreover, their biological functions have been identified in some cases [8].Nonetheless, the indepth process (scenario) for the de novo origination is not well understood.
A comprehensive catalogue of de novo protein-coding genes in the Saccharomyces cerevisiae genome has been compiled by Carvunis et al. [2].The catalogue consists of 777 annotated and 1,139 unannotated ORFs.The 777 annotated ORFs were identified based on homology search against protein and nucleotide sequences of the 14 other yeast species up to Schizosaccharomyces pombe.According to their conservation levels over the Ascomycota phylogeny, they were classified into four groups S 1 ∼ S 4 , that is, ORF S 1 is specific to S. cerevisiae, while ORF S 2 , ORF S 3 and ORF S 4 are conserved from S. cerevisiae to Saccharomyces paradoxus, Saccharomyces mikatae and Saccharomyces bayanus, respectively.The 1,139 unannotated ORFs, which is denoted by ORF + S 0 , were identified based on transcription and translation evidence, that is, RNA-seq and ribosome profiling data in rich and starvation conditions.They are unannotated ORFs specific to S. cerevisiae, longer than 30 nucleotides and free from overlap with annotated ORFs on the same strand.Note that ORF + S 0 s would exhibit features intermediate between genes and non-genes.A considerable number of them might lose the ability to be transcribed and translated.However, it is notable that de novo protein-coding genes are expected to gradually emerge from them.

Approach
Inspired by the above data, we have conceived bioinformatic analysis for sketching a scenario for de novo origination of protein-coding genes.Our analysis can be applied when the following conditions are met: (a) a number of de novo protein-coding genes in a species are catalogued, (b) their conservation levels over phylogeny are identified, (c) a phylogenetic tree of the closely relating species is given, and (d) their genome sequences are available.The data above meets the conditions.
Figure 1 shows the schematic diagram of our bioinformatic analysis.For each de novo protein-coding gene catalogued, we firstly identified an edge of a given phylogenetic tree where the gene was born based on parsimony.We secondly constructed the most likely ancestral DNA sequences of the gene corresponding to both end nodes of the edge from a multiple sequence alignment of the de novo gene and its orthologous regions.Note that ancestral sequences of the both parent and child nodes are non-genic and genic, respectively.We thirdly revealed statistical features observed in evolution from the parent ancestral to the child ancestral sequences.Our bioinformatic analysis enables us to capture statistical features of DNA sequences observed in evolution from non-genic to genic sequences.Based on the statistical features observed, we finally sketched a putative scenario for de novo origination of protein-coding genes.

Data
We list below a set of data which we used for the bioinformatic analysis.Genome sequences of Saccharomyces sensu stricto species, that is, S. cerevisiae, S. paradoxus, S. mikatae and S. bayanus, were obtained from Saccharomyces Genome Resequencing Project (SGRP) and Saccharomyces Genome Database (SGD).See Table 1 for the detailed information of the data.A well-studied phylogenetic tree of S. sensu stricto species was given by Kellis et al. [9].A comprehensive catalogue of de novo protein-coding genes of S. cerevisiae was compiled by Carvunis et al. [2].In this catalogue, genomic locations of de novo protein-coding genes and their conservation levels over the phylogeny are shown.The detailed contents of the catalogue are described in the last paragraph of 'Introduction' .

Bioinformatic analysis
According to genomic locations of de novo proteincoding genes shown in the catalogue, we extracted their nucleotide sequences from the S. cerevisiae genome.Each extracted sequence includes flanking regions which are 500 nt sequence long from both ends of a gene.Since statistics calculated from short de novo protein-coding genes are unreliable, we discarded ones whose ORF lengths are < 60 nt.The number of de novo protein-coding genes Fig. 1 Schematic diagram of our bioinformatic analysis.This shows the analysis of a de novo protein-coding gene ORF S2 of S. cerevisiae, which is conserved in S. cerevisiae and S. paradoxus.According to parsimony, we can consider that the gene was born on the edge ( ⋆ ) of a given phylogenetic tree.Then, we can construct the most likely ancestral DNA sequences of the gene corresponding to both end nodes (A 1 and A 2 ) of the edge from a multiple sequence alignment of the gene and its orthologous regions.Note that an ancestral sequence of A 1 is genic, while that of A 2 is non-genic.We then reveal statistical features observed in evolution from the ancestral sequence of A 2 to that of A 1 , and finally sketched a putative scenario for de novo origination of protein-coding genes based on the statistical features observed discarded was 692, which consist of 691 ORF + S 0 s and 1 ORF S 1 .In addition, to calculate background statistics, we randomly extracted 6,006 intergene ORFs (below, refer as intergenes), whose sequence lengths are ≥ 60 nt and genomic regions do not overlap with RNA-seq data in starvation and rich conditions [2], from the S. cerevisiae genome.Each extracted sequence also includes flanking regions which are 500 nt sequence long from both ends of an intergene ORF.While a large number of the intergene ORFs are expected to be non-genic, a small number of genic ones could still remain.
For each extracted sequence of the de novo proteincoding genes, we identified its orthologous sequences in S. paradoxus, S. mikatae and S. bayanus genomes.That is, we aligned it to each of the three genome sequences using glsearch [10], whose alignments are global in the query and local in the database sequences (global-local alignments), and collected sequences given by the best alignments for respective genomes.Since the flanking regions in an extracted sequence contain parts of protein-coding sequences of the neighboring genes in S. cerevisiae genome in most cases [11], this global-local alignment can be regarded as a synteny-based method.We regarded a set of extracted and collected sequences as orthologue.
A set of orthologous sequences was then aligned using MAFFT [12].From the multiple sequence alignment, the most likely ancestral sequences before and after a gene was born de novo were constructed using FastML [13].To obtain reliable ancestral sequences, we constructed ancestral sequences until the most recent common ancestor of S. cerevisiae, S. paradoxus and S. mikatae, because evolutionary distance between S. cerevisiae and S. bayanus is much greater than that between S. cerevisiae and S. mikatae [14].Therefore, much of the analysis here was limited to ORF + S 0 and ORF S 1∼2 , where S. mikatae was used as an outgroup.Moreover, we discarded ancestral sequences whose ,T} p i,j were less than 0.80, where L is the length of an ancestral sequence, and p i,j is the pos- terior probability of base j at position i in the sequence.
We focused here on the following statistical features observed in ancestral sequences before and after genes were born: (i) GC content, (ii) mutations accumulated, (iii) extension and shrinkage of ORF length, and (iv) appearance and disappearance of translation signature (Kozak consensus sequence).We evaluated these statistical features based on pairwise alignments of the ancestral sequences.Since aligned bases implies homologous bases, we can identify accumulated mutations, that is, aligned mismatch bases mean substitutions, and aligned bases and gaps mean indels.Moreover, we can also identify ancestral ORFs from recent ORFs by using consistency of reading frames and their overlapping length.As for (i), GC content of a genic ORF was calculated from base contents at the third codon positions of the ORF.Because base content at the third codon positions is mostly unconstrained by functional requirements, that is, by the need to code specific amino acids, the third codon position is a natural candidate for a predictive proxy of flanking GC content [15].GC content of a non-genic ORF was calculated from base contents of entire region of the ORF.As for (ii), frequencies of substitutions and indels were calculated.Substitutions were classified into transition and transversion.Substitutions were also classified into ones concerned with GC pressure, that is, AT>CG and CG>AT, where α 1 α 2 > β 1 β 2 indicates that base α 1 or α 2 was substituted for base β 1 or β 2 .As for (iii) and (iv), we defined a start codon of an ORF by the most upstream ATG and the most upstream Kozak consensus sequence (A..ATG and A..ATG .C) [16], where '.' indicates any base.

GC content
While de novo protein-coding genes in the catalogue are located in GC-rich regions of the S. cerevisiae genome (Fig. 2a), these regions had already been GC-rich before they were born (Fig. 2b and c).This indicates that de novo protein-coding genes are preferentially born from GC-rich non-genic regions.The similar preference was reported by Vakirlis et al. [17] using independent data sets of S. sensu stricto.Although it is known that long ORFs are more likely to be observed by chance in GCrich regions, interestingly, they proposed the relationship to promoter activity intrinsic to such regions.See Table 2 for the detailed information of Fig. 2.

Mutations
Mutations that were accumulated during evolution of ORFs from non-genic to genic sequences were not significantly different from those that were accumulated in intergenes (Fig. 3), with few exceptions (* in Fig. 3a).
Since mutations that are accumulated in intergenes are expected to be neutral, mutations that were accumulated during evolution of ORFs from non-genic to genic sequences are considered to be neutral.Lu et al. obtained similar results from independent data sets of S. sensu stricto [18].However, there remains the probability that these mutations may contain a small amount of adaptive ones.See Table 3 for the detailed information of Fig. 3.
Fig. 2 a A box plot representing GC contents (%GC) of de novo protein-coding genes (ORF + S 0 and ORF S1∼4 ), annotated ORFs and intergenes in the S. cerevisiae genome.%GC of de novo protein-coding genes are higher than that of intergenes and are comparable with that of annotated ORFs.Statistical differences between mean %GC of intergene and each of the others were evaluated by Welch's t-test, and p-values were ≤ 5.2 × 10 −14 (marked by ***).b A box plot representing %GC of de novo protein-coding genes (ORF + S 0 , ORF S1 and ORF S2 ) and intergene in the ancestral genome of A 1 (see Fig. 1 for A 1 ).%GC of de novo protein-coding genes are higher than that of intergene.Note that ORF + S 0 and ORF S1 at A 1 are non-genic.Statistical differences between mean %GC of intergene and each of the others were evaluated by Welch's t-test, and p-values were ≤ 4.8 × 10 −6 (marked by ***).c A box plot representing %GC of de novo protein-coding genes (ORF + S 0 , ORF S1 and ORF S2 ) and intergene in the ancestral genome of A 2 (see Fig. 1 for A 2 ).%GC of de novo protein-coding genes are higher than that of intergene.Note that ORF + S 0 and ORF S1∼2 at A 2 are non-genic.Statistical differences between mean %GC of intergene and each of the others were evaluated by Welch's t-test, and p-values were ≤ 1.1 × 10 −11 (marked by ***)

ORF length
Although no significant trend in length changes of ORF + S 0 was observed during the evolutionary time from A 1 to S. cerevisiae (Table 4a), significant trend in length extension of ORF S 1 was observed during this evolutionary time (Table 4b), where A 1 is the common ancestor of S. cerevisiae and S. paradoxus (Fig. 1).This difference is due to the difference in sequence length between ORF S 1 and ORF + S 0 .That is, the sequence length of ORF S 1 s, which are annotated ORFs, is long, and that of ORF + S 0 s, which are unannotated ORFs, is short, while ORF length of their orthologous regions in A 1 is short in both.Interestingly, while length changes of ORF + S 0 are small, those of ORF S 1 are positively large (Fig. 4).These indicate that ORF + S 0 was formed by repeating its extension and shrinkage by frameshift and nonsense mutations, and ORF S 1 were formed by combining other ORFs into itself by frameshift mutations.

Translation signature
Although ORF + S 0 and ORF S 1 evolved from non-genic to genic sequences during the evolutionary time from A 1 to S. cerevisiae (see Fig. 1), no significant trend in appearance of Kozak sequences of them was observed during this evolutionary time (Table 5), where A 1 is the common ancestor of S. cerevisiae and S. paradoxus (Fig. 1).Carvunis et al., however, reported that proportion of de novo protein-coding genes with Kozak sequences increases as their conservation levels increase, that is, as they age [2].Taken together, these two results indicate that there is a time delay between the appearance of Kozak sequences and the acquisition of protein-coding properties.This is Table 2 a GC contents (%GC) of de novo protein-coding genes (ORF + S 0 and ORF S1∼4 ), annotated ORFs and intergenes in the S. cerevisiae genome.This table is the detailed information of Fig. 2a.b %GC of de novo protein-coding genes (ORF + S 0 , ORF S1 and ORF S2 ) and intergene in the ancestral genomes of A 1 and A 2 (see Fig. 1 for A 1 and A 2 ).This table is the detailed information of Fig. 2b and c  to S. cerevisiae (see Fig. 1 for A 1 ).During this evolutionary time, ORF + S 0 and ORF S1 evolved from non-genic to genic sequences.Statistical differences between mean mutation frequency of intergene and each of the others were evaluated by Welch's t-test, and significant difference was observed only in transition frequency between intergene and ORF S1 (marked by *).b A box plot representing frequencies of mutations (transition, transversion, . . ., and indel) accumulated in ORF S2 and intergene during the evolutionary time from A 2 to A 1 (see Fig. 1 for A 1 and A 2 ).During this evolutionary time, ORF S2 evolved from non-genic to genic sequences.Statistical differences between mean mutation frequency of intergene and each of the others were evaluated by Welch's t-test, and no significant differences were observed in contrast to the simultaneous occurrence of the extension of ORF lengths and the acquisition of protein-coding properties.These may be due to the higher specificities of positions and bases of neutral mutations by which Kozak sequences appear compared to neutral mutations by which ORFs are extended.That is, the former mutations are ones to specific bases at specific positions surrounding start codons, while the latter mutations are ones with low base specificities in stop codons and overlapping regions of ORFs.

A putative scenario for de novo gene origination
To summarize all of the above, we found that (i) de novo protein-coding genes tend to originate from GC-rich regions, (ii) mutations that evolve ORFs from non-genic to genic sequences are neutral, and (iii) these mutations frequently extend/combine ORFs, (iv) followed by recruitment of translation signature (Kozak sequence).These findings naturally lead us to a putative a scenario of de novo origination of protein-coding genes.(1) In the beginning was GC-rich genome regions.(2) Neutral mutations were accumulated in the regions.(3) ORFs were extended/combined, and then (4) translation signature (Kozak sequence) was recruited.Interestingly, as the scenario progresses from (2) to (4), the specificity of neutral mutations increases (see previous subsection).That is, the order of events in the scenario is governed by ratio of number of neutral mutations being capable of causing each event to total number of all possible neutral mutations, and events occur in descending order of this ratio.
To the best of our knowledge, this is the first report outlining a scenario of de novo origination of proteincoding genes.As we mentioned the above, some of the statistical features in the scenario have been reported by independent researches [17,18].However, since each of them uses distinct data, there is insufficient evidence for outlining a single scenario from them.We applied here a systematic analysis to comprehensive data compiled by a single research group, then there is Table 3 a Frequencies of mutations (transition, transversion, . . ., and indel) accumulated in ORF + S 0 , ORF S1 and intergene during the evolutionary time from A 1 to S. cerevisiae (see Fig. 1 for A 1 ).Average (Ave) and standard deviation (SD) of the frequencies are shown.During this evolutionary time, ORF + S 0 and ORF S1 evolved from non-genic to genic sequences.Numbers in parentheses are the numbers of sequences.This table is the detailed information of Fig. 3a.b Frequencies of mutations (transition, transversion, . . ., and indel) accumulated in ORF S2 and intergene during the evolutionary time from A 2 to A 1 (see Fig. 1 for A 1 and A 2 ).Average (Ave) and standard deviation (SD) of the frequencies are shown.During this evolutionary time, ORF S2 evolved from non-genic to genic sequences.Numbers in parentheses are the numbers of sequences.This table is the detailed of Fig. 3b a sufficient evidence for outlining a single scenario from statistical features observed.On the other hand, we should note that our scenario was derived from a de novo protein-coding gene catalogue containing ORF + S 0 s which are free from overlap with annotated ORFs on the same strand [2].Therefore, the ability of our scenario to explain process of de novo gene origination by overprinting and exonization [19] is limited.

Advantages of our bioinformatic analysis
Our bioinformatic analysis has successfully sketched a putative scenario for de novo origination of proteincoding genes.The main advantage of our bioinformatic analysis is that we can directly observe the evolution from non-genic to genic sequences by constructing ancestral sequences.This direct observation enables our analysis to capture events that occur during a short evolutionary time.Therefore, the order of ( 3) and (4) in the above scenario was clearly shown by our analysis, while this order was not clear in existing analyses such as cross-species comparison [2].As Klasberg et al. pointed out [20], the existing analyses can not fully capture fast evolution of de novo protein-coding genes.However, construction of ancestral sequences is fraught with ambiguity.Therefore, we adopted here conservative filters, such that we declined to construct the common ancestral sequences of S. sensu stricto (the 3rd paragraph of 'Bioinformatic analysis').
Another advantage of our analysis is that we can easily improve time resolution by incorporating closely related species data.Saccharomyces boulardii is located closer to S. cerevisiae than S. paradoxus on the S. sensu stricto phylogenetic tree [21].By incorporating S. boulardii data, we can observe the evolution hfrom non-genic to genic sequences that occurs in a short period of time.On the other hand, Saccharomyces kudriavzevii is located where it fills the large gap between S. mikatae and S. bayanus on the S. sensu stricto phylogenetic tree [14].By incorporating S. kudriavzevii data, we can observe the evolution from non-genic to genic sequences that occurred in older times.

An insight into 'ORF first' model and 'transcription first' model
Finally, we provide an insight into 'ORF first' model and 'transcription first' model [22] from viewpoints of our research.For de novo origination of a protein-coding gene to occur, a non-genic sequence must both be transcribed and acquire an ORF before becoming translated.These events may in theory occur in either order, and there is evidence supporting both an 'ORF first' and a 'transcription first' model.On the other hand, our research suggested that events during de novo origination of protein-coding genes tend to occur in order of which they are likely to occur due to neutral mutations accumulated.This can lead us naturally to a conjectures below.Table 4 Changes in lengths of ORF + S 0 (a) and ORF S1 (b) during the evolutionary time from A 1 to S. cerevisiae (see Fig. 1 for A 1 ).During this evolutionary time, ORF + S 0 and ORF S1 evolved from non-genic to genic sequences.Binomial test was used to detect significant trend in changes of ORF length.The background probabilities of the test were calculated from length changes in intergene ORFs during this evolutionary time (data not shown).Although no significant trend was observed for ORF + S 0 , significant trend was observed for ORF S1 (marked by *** and **).When adopting the most upstream A..ATG.C as start codons, no significant trend was observed even for ORF S1 , because of small numbers of samples In GC-rich regions, both of 'ORF first' and 'transcription first' might be observed, because they may already have relatively long ORFs and may already obtain weak transcriptional activity.In AT-rich regions, 'ORF first' might be mainly observed, because a few number of neutral mutations can extend ORF length, but cannot elevate GC content.Although they can also result in appearance of transcription factor binding sites, it is expected that its effect on de novo origination of protein-coding genes would be small compared to that of ORF extension.
Fig. 4 Changes in lengths of ORF + S 0 (red) and ORF S1 (blue) during the evolutionary time from A 1 to S. cerevisiae (see Fig. 1 for A 1 ).During this evolutionary time, ORF + S 0 and ORF S1 evolved from non-genic to genic sequences.a The most upstream ATG was used as a start codon of an ORF.b The most upstream A.. ATG was used as a start codon of an ORF.c The most upstream A..ATG.C was used as a start codon of an ORF.These figures are the detailed information of Table 4 Table 5 Appearance and disappearance of Kozak sequences A..ATG (a) and A..ATG.C (b) during the evolutionary time from A 1 to S. cerevisiae (see Fig. 1 for A 1 ).During this evolutionary time, ORF + S 0 and ORF S1 evolved from non-genic to genic sequences.While binomial test was used to detect significant trend in appearance and disappearance of Kozak sequence, no significant trend was observed.The background probabilities of the test were calculated from those of intergene ORFs during this evolutionary time (data not shown)

Table 1
Genome sequence data.A set of genome sequence data which was used for our bioinformatic analysis is listed